Nothing new under the sun: Mediaeval line plot, circa 1010. Image: wikimedia
Before you start
- Make sure you complete the worksheet on exploring data and plotting
- Remember to load the tidyverse
library(tidyverse)Overview
The workshop is intended to follow the second lecture of PSYC753. If you want to skip straight to where the exercises start click here
In the session we watched Hans Rosling’s “200 countries and 200 years in 4 minutes”, which we (hopefully) agreed is something to aspire to.
Combined with his enthusiastic presentation, the visualisations in this clip support a clear narrative and help us understand this complex dataset.
The plot he builds plot is interesting because it uses many different features to express features in the data:
- X and Y axes
- Size of the points
- Colour
- Time (in the animation)
These features are carefully selected to highlight important features of the data and support the narrative he provides.
Although we need to have integrity in our plotting (we’ll see bad examples later), this narrative aspect of a plot is important: we always need to consider our audience.
Multi-dimensional plotting
This sounds fancy, but as we saw it just means linking different visual or perceptual features of a plot to different variables in the data.
We can determine how complex a plot is by how many dimensions it has.
For example, this plot has one dimension representing how revolting particular fruits and vegatables are:
Mango Pear Aubergine Snozzcumber
| | | |
—————————————————————————————————————————————————————————————————
Scatter plots are more complex because they have two dimensions — that is, they show two variables at once. The variables are represented by the position of each point on the X and Y axes:
Life expectancy and GDP per capita in countries around the world in 2002
And Rosling’s plot is more complex still because it adds dimensions of colour and size, and uses a special logarithmic scale for the x-axis (more on this later).
Defining dimensions/aesthetics in ggplot
As you have already seen, ggplot uses the term aesthetics to refer to different dimensions of a plot. ‘Aesthetics’ refers to ‘what things look like’, and the aes() command in ggplot creates links variables (columns in the dataset) to visual features of the plot. This is called a mapping.
There are x visual features (aesthetics) of plots we will use in this session:
xandyaxescoloursize(of a point, or thickness of a line)shape(of points)linetype(i.e. dotted/patterned or solid)
Recreate the Rosling plot
Rosling’s plot looked something like this:
To create a (slightly simplified) version of the plot above, the code would look something like this:
gapminder::gapminder %>%
filter(BLANK==BLANK) %>%
ggplot(aes(x=BLANK,
y=BLANK,
size=BLANK,
color=BLANK)) +
geom_point()I have removed some parts of the code. Your job is to edit the parts which say <BLANK> and replace them with the names of variables from the gapminder::gapminder dataset (you don’t need to load this — it’s part of the gapminder package).
Some hints:
- The dataset is called
gapminder::gapminderand you need to write that in full (with the colons) - Check the title of the figure above to work out which rows of the data you need to plot (and so define the filter)
- All the
BLANKs represent variable names in the dataset. You can see a list of the column names available by typingnames(gapminder::gapminder) - If you are confused by the
filter(BLANK==BLANK)check the title of the plot above. What data do we need to select for this plot? Remember thatfilterselects particular rows from the data, so we can use it to restrict what is shown.
Summary so far
- Plots can have multiple dimensions; that is, they can display several variables at once
- Colour, shape, size and line-type are common ways of displaying Dimensions
- In ggplot, these visual features are called aesthetics
- The mapping between visual features and variables is created using the
aes()command
Using multiple layers
When visualising data, there’s always more than one way to do things. As well as plotting different dimensions, different types of plot can highlight different features of the data. In ggplot, different types of plots are called geometries. Multiple layers can be combined in the same plot by adding together commands which have the prefix geom_.
As we have already seen, we can use geom_point(...) to create a scatter plot:
Life expectancy and GDP in Asia
To add additional layers to this plot, we can add extra geom_<NAME> functions. For example, geom_smooth overlays a smooth line to any x/y plot:
gapminder::gapminder %>%
filter(continent=="Asia") %>%
ggplot(aes(lifeExp, gdpPercap)) +
geom_point() +
geom_smooth()Explanation of the command: We added + geom_smooth() to our previous plot. This means we now have two geometries added to the same plot: geom_point and geom_smooth.
Explanation of the output: If you run the command above you will see some warning messages which say geom_smooth() using method = 'gam' and formula 'y ~ s(x, bs = "cs")'. You can ignore this for the moment. The plot shown is the same as the scatterplot before, but now has a smooth blue line overlaid. This represents the local-average of GDP, for each level of lifeExp. There is also a grey-shaded area, which represents the standard error of the local average (again there will be more on this later).
Task: Make a smoothed-line plot
Reopen the
cps2.csvdata, or use themtcarsoririsdata. Create a scatter plot of any two continuous variables.Add a smoothed line to the plot using
geom_smoothAdd the colour or size aesthetics to the plot above (so using an a third variable)
Summary of this section
ggplotdoesn’t restrict you to a single view of the data- Plots can have multiple layers, presenting the same data different ways
- Each layer is called a ‘geometry’, and the functions to add layers all start with
geom_ - Smoothed-line plots show the local average
Facets
As we add layers our plots become more complex. We run into trade-offs between information density and clarity.
To give one example, this plot shows life expectancies for each country in the gapminder data, plotted by year:
gapminder::gapminder %>%
ggplot(aes(year, lifeExp, group=country)) +
geom_smooth(se=FALSE)Explanation: This is another x/y plot. However this time we have not added points, but rather smoothed lines (one for each country).
Explanation of the code:We have created an x/y plot as before, but this time we only added geom_smooth (and not geom_point), so we can’t see the individual datapoints. We have also added the text group=country which means we see one line per-country in the dataset. Finally, we also added se=FALSE which hides the shaded area that geom_smooth adds by default representing the standard error of the line.
Comment on the result: It’s pretty hard to read!
To increase the information density, and explore patterns within the data, we might add another dimension and aesthetic. The next plot colours the lines by continent:
gapminder::gapminder %>%
ggplot(aes(year, lifeExp, colour=continent, group=country)) +
geom_smooth(se=FALSE)However, even with colours added it’s still a bit of a mess. We can’t see the differences between continents easily. To clean things up we can use a technique called facetting:
gapminder::gapminder %>%
ggplot(aes(year, lifeExp, group=country)) +
geom_smooth(se=FALSE) +
facet_wrap(~continent)Explanation: We added the text + facet_grid(~continent) to our earlier plot, but removed the part that said color=continent. This made ggplot create individual panels for each continent. Splitting the graph this way makes it somewhat easier to compare the differences between continents.
Task: Use facetting
Use the iris dataset which is built into R.
- Try to recreate this plot, by adapting the code from the example above:
Create a new plot which uses colours to distinguish species and does not use Facets
In this example, which plot do you prefer? What influences when facets are more useful the just using colour?
There’s no right answer here, but for this example I prefer the coloured plot to the facetted one. The reason is that there are only 3 species in this dataset, and the points for each don’t overlap much. This means it is easy to distinguish them, even in the combined plot. But, if there were many different species it might be helpful to use facets instead.
Our decisions should be driven by what we are trying to communicate with the plot. What was the research question that motivated us to draw it?
Try replacing
facet_grid(~continent)withfacet_grid(continent~.). What happens?With the
gapminderexample from above, try replacingfacet_gridwithfacet_wrap(~continent). What happens?To see more facetting examples, see the documentation.
Summary of this section
- Information density is good, but can harm clarity
- Using facets can help us make comparisons between groups in the data
- It can be helpful when adding another aesthetic would clutter our plot (e.g. we have too many groups to use a different colour for each)
Scales
As we’ve already seen, plots can include multiple dimensions, and these dimensions can be displayed using position (on the x/y axes) or colour, size etc.
ggplot does a really good job of picking good defaults when it converts the numbers in your dataset to positions, colours or other visual features of plots. However in some cases it is useful to know that you can change the default scales used.
Continuous vs. categorical scales
So far, when we have used colour to display information we have always used categorical variables. This means the colour scales in our plots have looked something like this:
For example, this plot shows the relationship between life expectancy and GDP in 2002, coloured by continent (using the gapminder data):
gapminder::gapminder %>%
filter(year==2002) %>%
ggplot(aes(gdpPercap, lifeExp, colour=continent)) +
geom_point()However in other cases we might want to use colour to display a continuous variable. If we want to plot continuous data in colour, we need a scale like this:
A continuous colour scale
In the plot below the x and y axes show the relationship between fuel economy (mpg) and weight (wt, recorded in 1000s of lbs). Colours are used to add information about how powerful (hp, short for horsepower) each car was:
mtcars %>%
ggplot(aes(wt, mpg, color=hp)) +
geom_point()Did high-powered cars tend to have good or poor fuel economy?
Sadly not!
Categorical, continuous and ‘other’ variables
Sometimes variables can be stored in the ‘wrong’ format in R.
To give one example, the mtcars dataset contains a column called am, which indicates if a car had an automatic or manual transmission. The variable is coded as either 0 (=automatic transmission) or 1 (=manual).
If we use the am variable for the colour aesthetic of a plot you will notice that ggplot wrongly uses a continuous colour scale, suggesting that there are values between 0 and 1:
mtcars %>%
ggplot(aes(wt, mpg, color=am)) +
geom_point()To fix this, and draw am as a categorical variable, we can use the factor command:
mtcars %>%
ggplot(aes(wt, mpg, color=factor(am))) +
geom_point()Explanation: We replaced colour=am in the previous plot with color=factor(am). The factor command forces R to plot am as a categorical variable. This means we now see only two distinct colours in the plot for values of 0 and 1, rather than a gradation for values between 0 and 1.
The mtcars dataset contains another variable, cyl, which records how many cylinders each car had.
Create a scatterplot of
mpgandwt, withcylas the colour aesthetic, treated as a categorical variable.Repeat this, but now use
cylas a continuous or numeric variable.Do the same again, but using a facet rather than the colour aesthetic.
:::{.exercise}
Optional: If you want to, do the section called ‘Logarithmic scales’ in the extensions worksheet.
Summary of this section
- Colour scales can be either categorical or continuous
- Sometimes data are stored in the ‘wrong’ format. We can use
factor(<VAR>)to force a variable to be categorical
And if you did the extension exercises:
- Logarithmic (log) scales create uneven spacing on the axes.
- Log scales can be useful when data have a skewed distribution, but we need to be careful when interpreting them
Comparing categories
In the examples above we have been plotting continuous variables (and adding colours etc). We’ve used density, scatter and smoothed line plots to do this.
Another common requirement is to use plots to compare summary statistics for different groups or categories. For example, the classic plot in a psychology study looks like this:
However, there is evidence that readers often misinterpret bar plots. Specifically, the problem is that we perceive values within the bar area as more likely than those just above, even though this is not in fact the case.
A better choice is (almost always) to use a boxplot:
expdata %>%
ggplot(aes(x=stimuli, y=RT)) + geom_boxplot()Explanation: We used Condition, a category, as our x axis, and reaction times as the y axis. We added geom_boxplot to show a boxplot.
If you’re not familiar with boxplots, there are more details in the help files (type ?geom_boxplot into the console) or use the wikipedia page here
Load the (simulated) dataset called expdata.csv.
Either download the file and upload to your Rstudio project directory, or read it directly from this url: https://gist.githubusercontent.com/benwhalley/f94baf447612e2434b181739dbba27df/raw/43df26022fff68f49918c795f27d7352dc0d3425/expdata.csv
- Recreate the boxplot above
- Adjust it so that
stimuliis the variale on the x-axis - Use a facet to recreate the plot you saw above, combining both
ConditionandStimuli
Other data summary layers
If you really need to plot the mean and standard error of different categories, ggplot has the stat_summary command:
expdata %>%
ggplot(aes(Condition, RT)) + stat_summary()Explanation: We used Condition and RT as our x and y axes, as before. This time we added stat_summary() instead of geom_boxplot(). By default this plots the mean and standard error (a measure of variability) in each group, using a point-range plot. This is better than a bar chart because it avoids multiple a known bias in how we read them. You can ignore the warning about No summary function supplied, defaulting to mean_se() for now.
As an extension exercise:
- Adapt your facetted boxplot from above to show the mean and standard error instead
- Can you combine both boxplot and summary in a single plot?
Spit and polish
Ggplot is great because it sets sensible defaults for most things (axes, colours etc). When you are exploring your data these defaults typically suffice. However for publication you will often need to polish up your plots, perhaps including:
- Label your plot axes
- Add lines or text
- Change plot colours etc
- Saving to a pdf or other output format
Labelling axes
By default, ggplot uses variable names and the values in your data to label plots. Sometimes these are abbreviations, or otherwise need changing.
To relabel axes we simply add + xlab("TEXT") or + ylab("TEXT") to an existing plot:
mtcars %>% ggplot(aes(wt, mpg)) +
geom_point() +
xlab("Weight (1000s of lbs)") +
ylab("Fuel economy (miles per gallon)")Try adding axis labels to one of your existing plots.
Changing the label of color/shape guidelines
If you are short of time you can treat the rest of this section like an extension exercise. It might be useful for your own work, but won’t form part of the assessment.
When adding the colour aesthetic, ggplot uses the variable name to label the plot legend. For example:
mtcars %>%
ggplot(aes(wt, mpg, colour=factor(cyl))) +
geom_point()The generated legend label sometimes looks ugly (like above) but this is easy to fix:
mtcars %>%
ggplot(aes(wt, mpg, colour=factor(cyl))) +
geom_point() +
labs(color="Cylinders")Explanation: We added labs(color="Cylinders") to the plot to change the legend label.
Try relabelling the colour legend of one of your existing plots.
Adding lines
Sometimes it can be helpful to add lines to a plot: for example to show a clinically meaningful cutoff, or the mean of a sample.
For example, let’s say we want to make a scatter plot of income in the cps2 data, but adding a line showing the median income. First we calculate the median:
cps2 <- read_csv("data/cps2.csv")
median_income <- cps2 %>% summarise(median(income)) %>% pull(1)Explanation: First, we are defining a new variable to equal the mean income in the sample. We do this by using summarise(mean(income)). The part which reads pull(1) says “take the first column”. We need to do this because summarise() creates a new table, rather than a single value or sequence of values (which we need below).
cps2 %>%
filter(income < 150000) %>%
ggplot(aes(income, y=..scaled..)) +
geom_density() +
geom_vline(xintercept = median_income, color="red")Explanation: We have regular density plot. This time we have added geom_vline which draws a vertical line. The xintercept is the place on the x axis where our line should cross.
Add a geom_vline to a plot you have already created. This could be either:
- A calculated value (e.g.
mean(var)) or - A fixed value (e.g.
xintercept = 20)
Saving plots to a file
So far we have created plots in the RStudio web interface. This is fine when working interactively, but sometimes you will need to send a high-quality plot to someone (perhaps a journal).
The ggsave function lets us do this.
The first step is to make a plot, and save it (give it a name).
myfunkyplot <- mtcars %>% ggplot(aes(wt, mpg, color=factor(cyl))) + geom_point()Explanation: We used the assignment operator <- to save our plot to a new name (myfunkyplot). This means that when we run the code RStudio won’t geneate any output immediately, so we don’t see the plot yet.
Next, we use ggsave to save the plot to a particular file:
ggsave('myfunkyplot.pdf', myfunkyplot, width=8, height=4)You can see the output of the ggsave command by downloading the file from the files directory of your RStudio window. It should end up in the same place as your R Script, provided you have created a project (and you should always create a project).
All content on this site is distributed under a Creative Commons licence. CC-BY-SA 4.0.